16 research outputs found

    Thresholds of descending algorithms in inference problems

    Full text link
    We review recent works on analyzing the dynamics of gradient-based algorithms in a prototypical statistical inference problem. Using methods and insights from the physics of glassy systems, these works showed how to understand quantitatively and qualitatively the performance of gradient-based algorithms. Here we review the key results and their interpretation in non-technical terms accessible to a wide audience of physicists in the context of related works.Comment: 8 pages, 4 figure

    Optimization and Generalization of Shallow Neural Networks with Quadratic Activation Functions

    Get PDF
    We study the dynamics of optimization and the generalization properties of one-hidden layer neural networks with quadratic activation function in the over-parametrized regime where the layer width mm is larger than the input dimension dd. We consider a teacher-student scenario where the teacher has the same structure as the student with a hidden layer of smaller width mmm^*\le m. We describe how the empirical loss landscape is affected by the number nn of data samples and the width mm^* of the teacher network. In particular we determine how the probability that there be no spurious minima on the empirical loss depends on nn, dd, and mm^*, thereby establishing conditions under which the neural network can in principle recover the teacher. We also show that under the same conditions gradient descent dynamics on the empirical loss converges and leads to small generalization error, i.e. it enables recovery in practice. Finally we characterize the time-convergence rate of gradient descent in the limit of a large number of samples. These results are confirmed by numerical experiments.Comment: 10 pages, 4 figures + appendi

    Who is afraid of big bad minima? Analysis of gradient-flow in spiked matrix-tensor models

    Get PDF
    Gradient-based algorithms are effective for many machine learning tasks, but despite ample recent effort and some progress, it often remains unclear why they work in practice in optimising high-dimensional non-convex functions and why they find good minima instead of being trapped in spurious ones. Here we present a quantitative theory explaining this behaviour in a spiked matrix-tensor model. Our framework is based on the Kac-Rice analysis of stationary points and a closed-form analysis of gradient-flow originating from statistical physics. We show that there is a well defined region of parameters where the gradient-flow algorithm finds a good global minimum despite the presence of exponentially many spurious local minima. We show that this is achieved by surfing on saddles that have strong negative direction towards the global minima, a phenomenon that is connected to a BBP-type threshold in the Hessian describing the critical points of the landscapes

    The RL Perceptron: Generalisation Dynamics of Policy Learning in High Dimensions

    Full text link
    Reinforcement learning (RL) algorithms have proven transformative in a range of domains. To tackle real-world domains, these systems often use neural networks to learn policies directly from pixels or other high-dimensional sensory input. By contrast, much theory of RL has focused on discrete state spaces or worst-case analysis, and fundamental questions remain about the dynamics of policy learning in high-dimensional settings. Here, we propose a solvable high-dimensional model of RL that can capture a variety of learning protocols, and derive its typical dynamics as a set of closed-form ordinary differential equations (ODEs). We derive optimal schedules for the learning rates and task difficulty - analogous to annealing schemes and curricula during training in RL - and show that the model exhibits rich behaviour, including delayed learning under sparse rewards; a variety of learning regimes depending on reward baselines; and a speed-accuracy trade-off driven by reward stringency. Experiments on variants of the Procgen game "Bossfight" and Arcade Learning Environment game "Pong" also show such a speed-accuracy trade-off in practice. Together, these results take a step towards closing the gap between theory and practice in high-dimensional RL.Comment: 10 pages, 6 figures, Preprin

    Complex Dynamics in Simple Neural Networks: Understanding Gradient Flow in Phase Retrieval

    Get PDF
    Despite the widespread use of gradient-based algorithms for optimizing high-dimensional non-convex functions, understanding their ability of finding good minima instead of being trapped in spurious ones remains to a large extent an open problem. Here we focus on gradient flow dynamics for phase retrieval from random measurements. When the ratio of the number of measurements over the input dimension is small the dynamics remains trapped in spurious minima with large basins of attraction. We find analytically that above a critical ratio those critical points become unstable developing a negative direction toward the signal. By numerical experiments we show that in this regime the gradient flow algorithm is not trapped; it drifts away from the spurious critical points along the unstable direction and succeeds in finding the global minimum. Using tools from statistical physics we characterize this phenomenon, which is related to a BBP-type transition in the Hessian of the spurious minima

    Epidemic mitigation by statistical inference from contact tracing data

    Full text link
    Contact-tracing is an essential tool in order to mitigate the impact of pandemic such as the COVID-19. In order to achieve efficient and scalable contact-tracing in real time, digital devices can play an important role. While a lot of attention has been paid to analyzing the privacy and ethical risks of the associated mobile applications, so far much less research has been devoted to optimizing their performance and assessing their impact on the mitigation of the epidemic. We develop Bayesian inference methods to estimate the risk that an individual is infected. This inference is based on the list of his recent contacts and their own risk levels, as well as personal information such as results of tests or presence of syndromes. We propose to use probabilistic risk estimation in order to optimize testing and quarantining strategies for the control of an epidemic. Our results show that in some range of epidemic spreading (typically when the manual tracing of all contacts of infected people becomes practically impossible, but before the fraction of infected people reaches the scale where a lock-down becomes unavoidable), this inference of individuals at risk could be an efficient way to mitigate the epidemic. Our approaches translate into fully distributed algorithms that only require communication between individuals who have recently been in contact. Such communication may be encrypted and anonymized and thus compatible with privacy preserving standards. We conclude that probabilistic risk estimation is capable to enhance performance of digital contact tracing and should be considered in the currently developed mobile applications.Comment: 21 pages, 7 figure

    Analytical Study of Momentum-Based Acceleration Methods in Paradigmatic High-Dimensional Non-Convex Problems

    Get PDF
    The optimization step in many machine learning problems rarely relies on vanilla gradient descent but it is common practice to use momentum-based accelerated methods. Despite these algorithms being widely applied to arbitrary loss functions, their behaviour in generically non-convex, high dimensional landscapes is poorly understood. In this work, we use dynamical mean field theory techniques to describe analytically the average dynamics of these methods in a prototypical non-convex model: the (spiked) matrix-tensor model. We derive a closed set of equations that describe the behaviour of heavy-ball momentum and Nesterov acceleration in the infinite dimensional limit. By numerical integration of these equations, we observe that these methods speed up the dynamics but do not improve the algorithmic threshold with respect to gradient descent in the spiked model.Comment: To appear in NeurIPS 202

    Sur la dynamique des algorithmes du gradient dans les modèles plantés en haute dimension

    No full text
    Optimization of high-dimensional non-convex models has always been a difficult and fascinating problem. Since our minds tend to apply notions that we experienced and naturally learned in low-dimension, our intuition is often led astray.Those problems appear naturally and become more and more relevant, in particular in an era where an increasingly large amount of data is available. Most of the information that we receive is useless and identifying what is relevant is an intricate problem.Machine learning problems and inference problems often fall in these settings.In both cases we have a cost function that depends on a large number of parameters that should be optimized. A rather simple, but common, choice is the use of local algorithms based on the gradient, that descend in the cost function trying to identify good solutions.If the cost function is convex, then under mild conditions on the descent rate, we are guaranteed to find the good solution. However often we do not have convex costs. To understand what happens in the dynamics of these non-convex high-dimensional problems is the main goal of this project.In the thesis we will space from Bayesian inference to machine learning in the attempt of building a theory that describes how the algorithmic dynamics evolve and when it is doomed to fail. Prototypical models of machine learning and inference problems are intimately related. Another interesting connection that is known since long time, is the link between inference problems and disordered systems studied by statistical physicists. The techniques and the results developed in the latter form the true skeleton of this work.In this dissertation we characterize the algorithmic limits of gradient descent and Langevin dynamics. We analyze the structure of the landscape and find the counter-intuitive result that in general an exponential number of spurious solutions do not prevent vanilla gradient descent initialized randomly to find the only good solution. Finally, we build a theory that explains quantitatively and qualitatively the underlying phenomenon.L'optimisation des modèles non convexes en haute dimension a toujours été un problème difficile et fascinant. Puisque nos avons la tendance à appliquer des notions que nous avons expérimentées et naturellement apprises en basse dimension, notre intuition est souvent égarée.Ces problèmes apparaissent naturellement et deviennent de plus en plus pertinents, en particulier dans une époque òu une quantité de plus en plus importante de données est disponible. La plupart des informations que nous recevons sont inutiles et l'identification de ce qui est pertinent est un problème complexe.Souvent les problèmes d'apprentissage automatique et les problèmes d'inférence entre dans cette catégorie.Dans les deux cas, nous avons une fonction de coût qui dépend d'un grand nombre de paramètres à optimiser. Un choix assez simple, mais courant, est l'utilisation d'algorithmes locaux basés sur le gradient, qui descendent dans la fonction de coût en essayant d'identifier les bonnes solutions.Si la fonction de coût est convexe alors il suffit de vérifier des simple conditions sur la vitesse de descente pour trouver la bonne solution. Cependant, souvent, nous n'avons pas de coûts convexes. Comprendre ce qui se passe dans la dynamique de ces problèmes non convèx en haute dimension est l'objectif principal de ce projet.Dans la thèse, on considéra les problèmes d'inférence bayésienne et d'apprentissage automatique en essayant de construire une théorie qui décrit comment la dynamique algorithmique évolue et quand elle est vouée à l’échec. Les modèles des problèmes d'apprentissage automatique et d'inférence sont intimement liés. Un autre lien intéressant et connu depuis longtemps est le lien entre les problèmes d'inférence et les systèmes désordonnés étudiés par les physiciens statistiques. Les techniques et les résultats développés dans ce dernier forment le véritable base de ce travail.Dans cette thèse, nous caractérisons les limites algorithmiques de la descente de gradient et la dynamique de Langevin. Nous analysons la structure du paysage et trouvons le résultat contre-intuitif qu'en général un nombre exponentiel de solutions fausses n’empêche pas la descente de gradient vanille initialisée au hasard vers la seule bonne solution. Enfin, nous construisons une théorie qui explique quantitativement et qualitativement le phénomène
    corecore